Problem Statement¶

Business Context¶

Renewable energy sources play an increasingly important role in the global energy mix, as the effort to reduce the environmental impact of energy production increases.

Out of all the renewable energy alternatives, wind energy is one of the most developed technologies worldwide. The U.S Department of Energy has put together a guide to achieving operational efficiency using predictive maintenance practices.

Predictive maintenance uses sensor information and analysis methods to measure and predict degradation and future component capability. The idea behind predictive maintenance is that failure patterns are predictable and if component failure can be predicted accurately and the component is replaced before it fails, the costs of operation and maintenance will be much lower.

The sensors fitted across different machines involved in the process of energy generation collect data related to various environmental factors (temperature, humidity, wind speed, etc.) and additional features related to various parts of the wind turbine (gearbox, tower, blades, break, etc.).

Objective¶

“ReneWind” is a company working on improving the machinery/processes involved in the production of wind energy using machine learning and has collected data of generator failure of wind turbines using sensors. They have shared a ciphered version of the data, as the data collected through sensors is confidential (the type of data collected varies with companies). Data has 40 predictors, 20000 observations in the training set and 5000 in the test set.

The objective is to build various classification models, tune them, and find the best one that will help identify failures so that the generators could be repaired before failing/breaking to reduce the overall maintenance cost. The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model. These will result in repairing costs.
  • False negatives (FN) are real failures where there is no detection by the model. These will result in replacement costs.
  • False positives (FP) are detections where there is no failure. These will result in inspection costs.

It is given that the cost of repairing a generator is much less than the cost of replacing it, and the cost of inspection is less than the cost of repair.

“1” in the target variables should be considered as “failure” and “0” represents “No failure”.

Data Description¶

  • The data provided is a transformed version of original data which was collected using sensors.
  • Train.csv - To be used for training and tuning of models.
  • Test.csv - To be used only for testing the performance of the final best model.
  • Both the datasets consist of 40 predictor variables and 1 target variable

Importing necessary libraries¶

In [1]:
#for data manipulation
import numpy as np
import pandas as pd

#for data visualization
import seaborn as sns
import matplotlib.pyplot as plt

#for statistics
import scipy.stats as stats

#for imputing (just in case there are missing values in the data)
from sklearn.impute import SimpleImputer

#for dividing training set into validation set and training set
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.model_selection import KFold

#for oversampling and undersampling
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler

#for regression models
from sklearn.linear_model import LogisticRegression

#for building different models
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier


#for deriving performance metrics of models
from sklearn import metrics
from sklearn.metrics import (precision_score, recall_score, f1_score, accuracy_score, confusion_matrix, roc_auc_score)
from sklearn.metrics import r2_score

#for turning hyperparameters of models
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

#for making pipeline
from sklearn.pipeline import Pipeline

#for ignoring warnings
import warnings
warnings.filterwarnings('ignore')

Loading the dataset¶

In [2]:
#loading the training set data into a dataframe called train
df_train = pd.read_csv('Train.csv.csv')
#loading the testing set data into a dataframe called test
df_test = pd.read_csv('Test.csv.csv')

Data Overview¶

The df_test dataframe is used only for the purpose of testing the models, so the main focus of the data analysis is the df_train dataframe. The df_test dataframe will also be viewed for the purpose of sanity checking and gaining a better understanding of the distribution of data.

In [3]:
#viewing the first five rows of df_train
df_train.head()
Out[3]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -4.464606 -4.679129 3.101546 0.506130 -0.221083 -2.032511 -2.910870 0.050714 -1.522351 3.761892 ... 3.059700 -1.690440 2.846296 2.235198 6.667486 0.443809 -2.369169 2.950578 -3.480324 0
1 3.365912 3.653381 0.909671 -1.367528 0.332016 2.358938 0.732600 -4.332135 0.565695 -0.101080 ... -1.795474 3.032780 -2.467514 1.894599 -2.297780 -1.731048 5.908837 -0.386345 0.616242 0
2 -3.831843 -5.824444 0.634031 -2.418815 -1.773827 1.016824 -2.098941 -3.173204 -2.081860 5.392621 ... -0.257101 0.803550 4.086219 2.292138 5.360850 0.351993 2.940021 3.839160 -4.309402 0
3 1.618098 1.888342 7.046143 -1.147285 0.083080 -1.529780 0.207309 -2.493629 0.344926 2.118578 ... -3.584425 -2.577474 1.363769 0.622714 5.550100 -1.526796 0.138853 3.101430 -1.277378 0
4 -0.111440 3.872488 -3.758361 -2.982897 3.792714 0.544960 0.205433 4.848994 -1.854920 -6.220023 ... 8.265896 6.629213 -10.068689 1.222987 -3.229763 1.686909 -2.163896 -3.644622 6.510338 0

5 rows × 41 columns

In [4]:
#viewing the last five rows of df_train
df_train.tail()
Out[4]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
19995 -2.071318 -1.088279 -0.796174 -3.011720 -2.287540 2.807310 0.481428 0.105171 -0.586599 -2.899398 ... -8.273996 5.745013 0.589014 -0.649988 -3.043174 2.216461 0.608723 0.178193 2.927755 1
19996 2.890264 2.483069 5.643919 0.937053 -1.380870 0.412051 -1.593386 -5.762498 2.150096 0.272302 ... -4.159092 1.181466 -0.742412 5.368979 -0.693028 -1.668971 3.659954 0.819863 -1.987265 0
19997 -3.896979 -3.942407 -0.351364 -2.417462 1.107546 -1.527623 -3.519882 2.054792 -0.233996 -0.357687 ... 7.112162 1.476080 -3.953710 1.855555 5.029209 2.082588 -6.409304 1.477138 -0.874148 0
19998 -3.187322 -10.051662 5.695955 -4.370053 -5.354758 -1.873044 -3.947210 0.679420 -2.389254 5.456756 ... 0.402812 3.163661 3.752095 8.529894 8.450626 0.203958 -7.129918 4.249394 -6.112267 0
19999 -2.686903 1.961187 6.137088 2.600133 2.657241 -4.290882 -2.344267 0.974004 -1.027462 0.497421 ... 6.620811 -1.988786 -1.348901 3.951801 5.449706 -0.455411 -2.202056 1.678229 -1.974413 0

5 rows × 41 columns

In [5]:
#viewing the first five rows of df_test
df_test.head()
Out[5]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
0 -0.613489 -3.819640 2.202302 1.300420 -1.184929 -4.495964 -1.835817 4.722989 1.206140 -0.341909 ... 2.291204 -5.411388 0.870073 0.574479 4.157191 1.428093 -10.511342 0.454664 -1.448363 0
1 0.389608 -0.512341 0.527053 -2.576776 -1.016766 2.235112 -0.441301 -4.405744 -0.332869 1.966794 ... -2.474936 2.493582 0.315165 2.059288 0.683859 -0.485452 5.128350 1.720744 -1.488235 0
2 -0.874861 -0.640632 4.084202 -1.590454 0.525855 -1.957592 -0.695367 1.347309 -1.732348 0.466500 ... -1.318888 -2.997464 0.459664 0.619774 5.631504 1.323512 -1.752154 1.808302 1.675748 0
3 0.238384 1.458607 4.014528 2.534478 1.196987 -3.117330 -0.924035 0.269493 1.322436 0.702345 ... 3.517918 -3.074085 -0.284220 0.954576 3.029331 -1.367198 -3.412140 0.906000 -2.450889 0
4 5.828225 2.768260 -1.234530 2.809264 -1.641648 -1.406698 0.568643 0.965043 1.918379 -2.774855 ... 1.773841 -1.501573 -2.226702 4.776830 -6.559698 -0.805551 -0.276007 -3.858207 -0.537694 0

5 rows × 41 columns

In [6]:
#viewing the last five rows of df_test
df_test.tail()
Out[6]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
4995 -5.120451 1.634804 1.251259 4.035944 3.291204 -2.932230 -1.328662 1.754066 -2.984586 1.248633 ... 9.979118 0.063438 0.217281 3.036388 2.109323 -0.557433 1.938718 0.512674 -2.694194 0
4996 -5.172498 1.171653 1.579105 1.219922 2.529627 -0.668648 -2.618321 -2.000545 0.633791 -0.578938 ... 4.423900 2.603811 -2.152170 0.917401 2.156586 0.466963 0.470120 2.196756 -2.376515 0
4997 -1.114136 -0.403576 -1.764875 -5.879475 3.571558 3.710802 -2.482952 -0.307614 -0.921945 -2.999141 ... 3.791778 7.481506 -10.061396 -0.387166 1.848509 1.818248 -1.245633 -1.260876 7.474682 0
4998 -1.703241 0.614650 6.220503 -0.104132 0.955916 -3.278706 -1.633855 -0.103936 1.388152 -1.065622 ... -4.100352 -5.949325 0.550372 -1.573640 6.823936 2.139307 -4.036164 3.436051 0.579249 0
4999 -0.603701 0.959550 -0.720995 8.229574 -1.815610 -2.275547 -2.574524 -1.041479 4.129645 -2.731288 ... 2.369776 -1.062408 0.790772 4.951955 -7.440825 -0.069506 -0.918083 -2.291154 -5.362891 0

5 rows × 41 columns

Observations:¶

  • A mentioned, the predictor variables are encoded since the information collected by the sensors is confidential.
  • Reach row seems to represent a wind turbine.
  • The target variable consists of 1 (representing failure) and 0 (representing no failure).
In [7]:
#viewing number of rows and columns in df_train
df_train.shape
Out[7]:
(20000, 41)
In [8]:
#viewing number of rows and columns in df_test
df_test.shape
Out[8]:
(5000, 41)

Observations:¶

  • Both df_train and df_test have 41 columns.
  • df_train has 20,000 rows and df_test has 5,000 rows. So, 20% of the data is in df_test, and 80% of the data is in df_train.
In [9]:
#viewing datatypes and column information for df_train
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      19982 non-null  float64
 1   V2      19982 non-null  float64
 2   V3      20000 non-null  float64
 3   V4      20000 non-null  float64
 4   V5      20000 non-null  float64
 5   V6      20000 non-null  float64
 6   V7      20000 non-null  float64
 7   V8      20000 non-null  float64
 8   V9      20000 non-null  float64
 9   V10     20000 non-null  float64
 10  V11     20000 non-null  float64
 11  V12     20000 non-null  float64
 12  V13     20000 non-null  float64
 13  V14     20000 non-null  float64
 14  V15     20000 non-null  float64
 15  V16     20000 non-null  float64
 16  V17     20000 non-null  float64
 17  V18     20000 non-null  float64
 18  V19     20000 non-null  float64
 19  V20     20000 non-null  float64
 20  V21     20000 non-null  float64
 21  V22     20000 non-null  float64
 22  V23     20000 non-null  float64
 23  V24     20000 non-null  float64
 24  V25     20000 non-null  float64
 25  V26     20000 non-null  float64
 26  V27     20000 non-null  float64
 27  V28     20000 non-null  float64
 28  V29     20000 non-null  float64
 29  V30     20000 non-null  float64
 30  V31     20000 non-null  float64
 31  V32     20000 non-null  float64
 32  V33     20000 non-null  float64
 33  V34     20000 non-null  float64
 34  V35     20000 non-null  float64
 35  V36     20000 non-null  float64
 36  V37     20000 non-null  float64
 37  V38     20000 non-null  float64
 38  V39     20000 non-null  float64
 39  V40     20000 non-null  float64
 40  Target  20000 non-null  int64  
dtypes: float64(40), int64(1)
memory usage: 6.3 MB
In [10]:
#viewing datatypes and column information for df_test
df_test.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 41 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      4995 non-null   float64
 1   V2      4994 non-null   float64
 2   V3      5000 non-null   float64
 3   V4      5000 non-null   float64
 4   V5      5000 non-null   float64
 5   V6      5000 non-null   float64
 6   V7      5000 non-null   float64
 7   V8      5000 non-null   float64
 8   V9      5000 non-null   float64
 9   V10     5000 non-null   float64
 10  V11     5000 non-null   float64
 11  V12     5000 non-null   float64
 12  V13     5000 non-null   float64
 13  V14     5000 non-null   float64
 14  V15     5000 non-null   float64
 15  V16     5000 non-null   float64
 16  V17     5000 non-null   float64
 17  V18     5000 non-null   float64
 18  V19     5000 non-null   float64
 19  V20     5000 non-null   float64
 20  V21     5000 non-null   float64
 21  V22     5000 non-null   float64
 22  V23     5000 non-null   float64
 23  V24     5000 non-null   float64
 24  V25     5000 non-null   float64
 25  V26     5000 non-null   float64
 26  V27     5000 non-null   float64
 27  V28     5000 non-null   float64
 28  V29     5000 non-null   float64
 29  V30     5000 non-null   float64
 30  V31     5000 non-null   float64
 31  V32     5000 non-null   float64
 32  V33     5000 non-null   float64
 33  V34     5000 non-null   float64
 34  V35     5000 non-null   float64
 35  V36     5000 non-null   float64
 36  V37     5000 non-null   float64
 37  V38     5000 non-null   float64
 38  V39     5000 non-null   float64
 39  V40     5000 non-null   float64
 40  Target  5000 non-null   int64  
dtypes: float64(40), int64(1)
memory usage: 1.6 MB

Observations:¶

  • All of the columns, except the target variable, are float datatypes.
  • The target variable is an integer data type.
  • In both df_train and df_test, there seem to be a few missing values in the columns V1 and V2.
In [11]:
#viewing a statistical summary for the columns in df_train
df_train.describe()
Out[11]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
count 19982.000000 19982.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 ... 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000 20000.000000
mean -0.271996 0.440430 2.484699 -0.083152 -0.053752 -0.995443 -0.879325 -0.548195 -0.016808 -0.012998 ... 0.303799 0.049825 -0.462702 2.229620 1.514809 0.011316 -0.344025 0.890653 -0.875630 0.055500
std 3.441625 3.150784 3.388963 3.431595 2.104801 2.040970 1.761626 3.295756 2.160568 2.193201 ... 5.500400 3.575285 3.183841 2.937102 3.800860 1.788165 3.948147 1.753054 3.012155 0.228959
min -11.876451 -12.319951 -10.708139 -15.082052 -8.603361 -10.227147 -7.949681 -15.657561 -8.596313 -9.853957 ... -19.876502 -16.898353 -17.985094 -15.349803 -14.833178 -5.478350 -17.375002 -6.438880 -11.023935 0.000000
25% -2.737146 -1.640674 0.206860 -2.347660 -1.535607 -2.347238 -2.030926 -2.642665 -1.494973 -1.411212 ... -3.420469 -2.242857 -2.136984 0.336191 -0.943809 -1.255819 -2.987638 -0.272250 -2.940193 0.000000
50% -0.747917 0.471536 2.255786 -0.135241 -0.101952 -1.000515 -0.917179 -0.389085 -0.067597 0.100973 ... 0.052073 -0.066249 -0.255008 2.098633 1.566526 -0.128435 -0.316849 0.919261 -0.920806 0.000000
75% 1.840112 2.543967 4.566165 2.130615 1.340480 0.380330 0.223695 1.722965 1.409203 1.477045 ... 3.761722 2.255134 1.436935 4.064358 3.983939 1.175533 2.279399 2.057540 1.119897 0.000000
max 15.493002 13.089269 17.090919 13.236381 8.133797 6.975847 8.006091 11.679495 8.137580 8.108472 ... 23.633187 16.692486 14.358213 15.291065 19.329576 7.467006 15.289923 7.759877 10.654265 1.000000

8 rows × 41 columns

In [12]:
#viewing a statistical summary for the columns in df_test
df_test.describe()
Out[12]:
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V32 V33 V34 V35 V36 V37 V38 V39 V40 Target
count 4995.000000 4994.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 ... 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000 5000.000000
mean -0.277622 0.397928 2.551787 -0.048943 -0.080120 -1.042138 -0.907922 -0.574592 0.030121 0.018524 ... 0.232567 -0.080115 -0.392663 2.211205 1.594845 0.022931 -0.405659 0.938800 -0.932406 0.056400
std 3.466280 3.139562 3.326607 3.413937 2.110870 2.005444 1.769017 3.331911 2.174139 2.145437 ... 5.585628 3.538624 3.166101 2.948426 3.774970 1.785320 3.968936 1.716502 2.978193 0.230716
min -12.381696 -10.716179 -9.237940 -14.682446 -7.711569 -8.924196 -8.124230 -12.252731 -6.785495 -8.170956 ... -17.244168 -14.903781 -14.699725 -12.260591 -12.735567 -5.079070 -15.334533 -5.451050 -10.076234 0.000000
25% -2.743691 -1.649211 0.314931 -2.292694 -1.615238 -2.368853 -2.054259 -2.642088 -1.455712 -1.353320 ... -3.556267 -2.348121 -2.009604 0.321818 -0.866066 -1.240526 -2.984480 -0.208024 -2.986587 0.000000
50% -0.764767 0.427369 2.260428 -0.145753 -0.131890 -1.048571 -0.939695 -0.357943 -0.079891 0.166292 ... -0.076694 -0.159713 -0.171745 2.111750 1.702964 -0.110415 -0.381162 0.959152 -1.002764 0.000000
75% 1.831313 2.444486 4.587000 2.166468 1.341197 0.307555 0.212228 1.712896 1.449548 1.511248 ... 3.751857 2.099160 1.465402 4.031639 4.104409 1.237522 2.287998 2.130769 1.079738 0.000000
max 13.504352 14.079073 15.314503 12.140157 7.672835 5.067685 7.616182 10.414722 8.850720 6.598728 ... 26.539391 13.323517 12.146302 13.489237 17.116122 6.809938 13.064950 7.182237 8.698460 1.000000

8 rows × 41 columns

Observations¶

  • The statistical summary contains similar values for df_train and df_test which shows that the data in df_test has a similar distribution.
  • The count also seems to indicate that there are missing values in V1 and V2 in both df_train and df_test.
In [13]:
#viewing duplicate values in df_train
df_train.duplicated().sum()
Out[13]:
0
In [14]:
#viewing duplicate values in df_test
df_test.duplicated().sum()
Out[14]:
0

Observations:¶

There are no duplicate values in df_train and df_test.

In [15]:
#viewing missing values in df_train
df_train.isnull().sum()
Out[15]:
V1        18
V2        18
V3         0
V4         0
V5         0
V6         0
V7         0
V8         0
V9         0
V10        0
V11        0
V12        0
V13        0
V14        0
V15        0
V16        0
V17        0
V18        0
V19        0
V20        0
V21        0
V22        0
V23        0
V24        0
V25        0
V26        0
V27        0
V28        0
V29        0
V30        0
V31        0
V32        0
V33        0
V34        0
V35        0
V36        0
V37        0
V38        0
V39        0
V40        0
Target     0
dtype: int64
In [16]:
#viewing missing values in df_test
df_test.isnull().sum()
Out[16]:
V1        5
V2        6
V3        0
V4        0
V5        0
V6        0
V7        0
V8        0
V9        0
V10       0
V11       0
V12       0
V13       0
V14       0
V15       0
V16       0
V17       0
V18       0
V19       0
V20       0
V21       0
V22       0
V23       0
V24       0
V25       0
V26       0
V27       0
V28       0
V29       0
V30       0
V31       0
V32       0
V33       0
V34       0
V35       0
V36       0
V37       0
V38       0
V39       0
V40       0
Target    0
dtype: int64

Observations:¶

  • There are missing values in df_train and df_test that need to be treated later on.
  • df_train has 18 missing values in V1 and 18 missing values in V2.
  • df_test has 5 missing values in V1 and 6 missing values in V2.

Exploratory Data Analysis (EDA)¶

Univariate Analysis¶

Plotting histograms and boxplots for all the variables¶

In [17]:
# function to plot a boxplot and a histogram along the same scale.


def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
    """
    Boxplot and histogram combined

    data: dataframe
    feature: dataframe column
    figsize: size of figure (default (12,7))
    kde: whether to the show density curve (default False)
    bins: number of bins for histogram (default None)
    """
    f2, (ax_box2, ax_hist2) = plt.subplots(
        nrows=2,  # Number of rows of the subplot grid= 2
        sharex=True,  # x-axis will be shared among all subplots
        gridspec_kw={'height_ratios': (0.25, 0.75)},
        figsize=figsize,
    )  # creating the 2 subplots
    sns.boxplot(
        data=data, x=feature, ax=ax_box2, showmeans=True, color='violet'
    )  # boxplot will be created and a star will indicate the mean value of the column
    sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette='winter'
    ) if bins else sns.histplot(
        data=data, x=feature, kde=kde, ax=ax_hist2
    )  # For histogram
    ax_hist2.axvline(
        data[feature].mean(), color='green', linestyle='--'
    )  # Add mean to the histogram
    ax_hist2.axvline(
        data[feature].median(), color='black', linestyle='-'
    )  # Add median to the histogram

Plotting all the features at one go¶

In [18]:
for feature in df_train.columns: #for each column in df_train
    histogram_boxplot(df_train, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
In [19]:
for feature in df_test.columns: #for each column in df_test
    histogram_boxplot(df_test, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis

Observations:¶

  • The distribution for all of the individual variables, except the target variable, in df_train and df_test is in the shape of a bell curve, which indicates that the data for each column (except target) is close to a normal distribution with a small degree of skewness.
  • For the target variable, there is a much higher sample size for not failed values than failed values. The failed values make up the minority class and the not failed values make up the majority class.
  • There are a few outliers, as seen in the box plots, but they represent authentic data points. Manipulating them will also manipulate authentic data, so it is best to leave them as they are.

Bivariate Analysis¶

In [20]:
#making a heatmap using columns from df_train to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(df_train.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
In [21]:
#making a heatmap using columns from df_test to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(df_test.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap

Observations:¶

  • In df_train and df_test, the target variable does not have any correlation with the other variables.
  • In df_train and df_test, V2 and V14 seem to have the highest negative correlation, but it is not less than -0.90.
  • In df_train and df_test, V7 and V15 seem to have the highest positive correlation, but it is not greater than 0.90.

Data Pre-processing¶

Feature Engineering¶

There are no columns that need to be dropped. Although there are some correlated columns, the correlation is not greater than 0.90 or less than -0.90. Also, the column names are hidden/encoded, so it is unknown which columns are related to each other or which columns belong together. It is best not to drop them.

There are no categorical variables that need to be encoded. All of the columns, except the target variable, are float datatypes. The target column is an integer datatype.

The values of the target variable are already in 0 and 1 format, so there is no need to change them.

Outlier Detection and Treatment¶

In the univariate analysis section, it can be seen that there are outliers. They are authentic data points, so they will not be changed.

Preparing Data for Modeling¶

Since the test set is already separate, there is no need to split data into train and test.

The test set needs to be organized into X_test and y_test.

However, the training set needs to be split into training and validation.

In [22]:
#adding all columns of df_test, except Target, to X_test
X_test = df_test.drop(['Target'], axis=1)
#adding Target column to y_test
y_test = df_test['Target']
#adding all columns of df_train, except Target, to X
X = df_train.drop(['Target'], axis=1)
#adding Target column to Y
Y = df_train['Target']
In [23]:
#dividing data in df_train into train and validation set using a test_size of 0.25
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.25, random_state=1, stratify=Y)
In [24]:
#checking number of columns in each set
print(X_train.shape, X_val.shape, X_test.shape, y_train.shape, y_val.shape, y_test.shape)
(15000, 40) (5000, 40) (5000, 40) (15000,) (5000,) (5000,)

Observations:¶

The data in df_train has been split into training and validation sets. The df_test set has been organized into X_test and y_test. There are 15,000 rows in the training set. In the testing and validation sets, there are 5,000 rows each.

Missing Value Imputation¶

The V1 and V2 columns are both a float datatype. Since the V1 and V2 columns have outliers in the df_train and df_test set, it is best to impute the missing values with the median.

To ensure no data leakage, the imputation of missing values is completed after the splitting of data.

In [25]:
#assigning imputer to SimpleImputer and setting imputing strategy to median
imputer = SimpleImputer(strategy='median')
In [26]:
#fitting and using imputer to transform the X_train set
X_train = pd.DataFrame(imputer.fit_transform(X_train), columns=X_train.columns)
#using imputer to transform the X_val set
X_val = pd.DataFrame(imputer.transform(X_val), columns=X_val.columns)
#using imputer to transform the X_test set
X_test = pd.DataFrame(imputer.transform(X_test), columns=X_test.columns)
In [27]:
#viewing missing values in X_train
X_train.isnull().sum()
Out[27]:
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
In [28]:
#viewing missing values in X_val
X_val.isnull().sum()
Out[28]:
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
In [29]:
#viewing missing values in X_test
X_test.isnull().sum()
Out[29]:
V1     0
V2     0
V3     0
V4     0
V5     0
V6     0
V7     0
V8     0
V9     0
V10    0
V11    0
V12    0
V13    0
V14    0
V15    0
V16    0
V17    0
V18    0
V19    0
V20    0
V21    0
V22    0
V23    0
V24    0
V25    0
V26    0
V27    0
V28    0
V29    0
V30    0
V31    0
V32    0
V33    0
V34    0
V35    0
V36    0
V37    0
V38    0
V39    0
V40    0
dtype: int64
In [30]:
#viewing missing values in y_train, y_val, and y_test
print(y_train.isnull().sum(),y_val.isnull().sum(),y_test.isnull().sum())
0 0 0

Observations:¶

There are no missing values anymore in the training, testing or validation sets.

Exploratory Data Analysis (After Data Pre-processing)¶

In [31]:
for feature in X_train.columns: #for each column in X_train
    histogram_boxplot(X_train, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
In [32]:
for feature in X_val.columns: #for each column in X_val
    histogram_boxplot(X_val, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis
In [33]:
for feature in X_test.columns: #for each column in X_test
    histogram_boxplot(X_test, feature, figsize=(12, 7), kde=False, bins=None) #plot column on x-axis

Observations:¶

After imputing missing values, there are no significant changes in the distributions of each variable in the training, validation or testing sets.

In [34]:
#making a heatmap using columns from X_train to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(X_train.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
In [35]:
#making a heatmap using columns from X_val to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(X_val.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap
In [36]:
#making a heatmap using columns from X_test to view correlation
#correlation range is from -1 to 1, labels are displayed and limited to 2 decimal places
plt.subplots(figsize=(20,15)) #adjusting size of heatmap to make sure all variables can be seen
sns.heatmap(X_test.corr(), annot=True, vmin=-1, vmax=1, fmt='.2f')
plt.show(); #displaying heatmap

Observations:¶

After imputing missing values, there are no significant changes in the correlation in the training, validation or testing sets.

Model Building¶

Model evaluation criterion¶

The nature of predictions made by the classification model will translate as follows:

  • True positives (TP) are failures correctly predicted by the model.
  • False negatives (FN) are real failures in a generator where there is no detection by model.
  • False positives (FP) are failure detections in a generator where there is no failure.

Which metric to optimize?

  • We need to choose the metric which will ensure that the maximum number of generator failures are predicted correctly by the model.
  • We would want Recall to be maximized as greater the Recall, the higher the chances of minimizing false negatives.
  • We want to minimize false negatives because if a model predicts that a machine will have no failure when there will be a failure, it will increase the maintenance cost.

Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.

In [37]:
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(model, predictors, target):
    """
    Function to compute different metrics to check classification model performance

    model: classifier
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    acc = accuracy_score(target, pred)  # to compute Accuracy
    recall = recall_score(target, pred)  # to compute Recall
    precision = precision_score(target, pred)  # to compute Precision
    f1 = f1_score(target, pred)  # to compute F1-score

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            'Accuracy': acc,
            'Recall': recall,
            'Precision': precision,
            'F1': f1
            
        },
        index=[0],
    )

    return df_perf

Defining scorer to be used for cross-validation and hyperparameter tuning¶

  • We want to reduce false negatives and will try to maximize "Recall".
  • To maximize Recall, we can use Recall as a scorer in cross-validation and hyperparameter tuning.
In [38]:
# Type of scoring used to compare parameter combinations
scorer = metrics.make_scorer(metrics.recall_score)

Model Building with original data¶

Sample Decision Tree model building with original data

In [39]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(('Logistic Regression', LogisticRegression(random_state=1)))
models.append(('Decision Tree', DecisionTreeClassifier(random_state=1)))
models.append(('Random Forest', RandomForestClassifier(random_state=1)))
models.append(('Bagging', BaggingClassifier(random_state=1)))
models.append(('Adaboost', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting', GradientBoostingClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print('\n' 'Cross-Validation performance on training dataset:' '\n')

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train, y=y_train, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print('{}: {}'.format(name, cv_result.mean()))

print('\n' 'Validation Performance:' '\n')

for name, model in models:
    model.fit(X_train, y_train)
    scores = recall_score(y_val, model.predict(X_val))
    print('{}: {}'.format(name, scores))
Cross-Validation performance on training dataset:

Logistic Regression: 0.4927566553639709
Decision Tree: 0.6982829521679532
Random Forest: 0.7235192266070268
Bagging: 0.7210807301060529
Adaboost: 0.6309140754635308
Gradient Boosting: 0.7066661857008874

Validation Performance:

Logistic Regression: 0.48201438848920863
Decision Tree: 0.7050359712230215
Random Forest: 0.7266187050359713
Bagging: 0.7302158273381295
Adaboost: 0.6762589928057554
Gradient Boosting: 0.7230215827338129

Model Building with Oversampled data¶

In [40]:
# Synthetic Minority Over Sampling Technique
sm = SMOTE(sampling_strategy=1, k_neighbors=5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train)
In [41]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(('Logistic Regression - Oversampled Data', LogisticRegression(random_state=1)))
models.append(('Decision Tree - Oversampled Data', DecisionTreeClassifier(random_state=1)))
models.append(('Random Forest - Oversampled Data', RandomForestClassifier(random_state=1)))
models.append(('Bagging - Oversampled Data', BaggingClassifier(random_state=1)))
models.append(('Adaboost - Oversampled Data', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting - Oversampled Data', GradientBoostingClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print('\n' 'Cross-Validation performance on training dataset:' '\n')

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_over, y=y_train_over, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print('{}: {}'.format(name, cv_result.mean()))

print('\n' 'Validation Performance:' '\n')

for name, model in models:
    model.fit(X_train_over, y_train_over)
    scores = recall_score(y_val, model.predict(X_val))
    print('{}: {}'.format(name, scores))
Cross-Validation performance on training dataset:

Logistic Regression - Oversampled Data: 0.883963699328486
Decision Tree - Oversampled Data: 0.9720494245534969
Random Forest - Oversampled Data: 0.9839075260047615
Bagging - Oversampled Data: 0.9762141471581656
Adaboost - Oversampled Data: 0.8978689011775473
Gradient Boosting - Oversampled Data: 0.9256068151319724

Validation Performance:

Logistic Regression - Oversampled Data: 0.8489208633093526
Decision Tree - Oversampled Data: 0.7769784172661871
Random Forest - Oversampled Data: 0.8489208633093526
Bagging - Oversampled Data: 0.8345323741007195
Adaboost - Oversampled Data: 0.8561151079136691
Gradient Boosting - Oversampled Data: 0.8776978417266187

Model Building with Undersampled data¶

In [42]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [43]:
models = []  # Empty list to store all the models

# Appending models into the list
models.append(('Logistic Regression - Undersampled Data', LogisticRegression(random_state=1)))
models.append(('Decision Tree - Undersampled Data', DecisionTreeClassifier(random_state=1)))
models.append(('Random Forest - Undersampled Data', RandomForestClassifier(random_state=1)))
models.append(('Bagging - Undersampled Data', BaggingClassifier(random_state=1)))
models.append(('Adaboost - Undersampled Data', AdaBoostClassifier(random_state=1)))
models.append(('Gradient Boosting - Undersampled Data', GradientBoostingClassifier(random_state=1)))

results1 = []  # Empty list to store all model's CV scores
names = []  # Empty list to store name of the models


# loop through all models to get the mean cross validated score
print('\n' 'Cross-Validation performance on training dataset:' '\n')

for name, model in models:
    kfold = StratifiedKFold(
        n_splits=5, shuffle=True, random_state=1
    )  # Setting number of splits equal to 5
    cv_result = cross_val_score(
        estimator=model, X=X_train_un, y=y_train_un, scoring=scorer, cv=kfold
    )
    results1.append(cv_result)
    names.append(name)
    print('{}: {}'.format(name, cv_result.mean()))

print('\n' 'Validation Performance:' '\n')

for name, model in models:
    model.fit(X_train_un, y_train_un)
    scores = recall_score(y_val, model.predict(X_val))
    print('{}: {}'.format(name, scores))
Cross-Validation performance on training dataset:

Logistic Regression - Undersampled Data: 0.8726138085275232
Decision Tree - Undersampled Data: 0.8617776495202367
Random Forest - Undersampled Data: 0.9038669648654498
Bagging - Undersampled Data: 0.8641945025611427
Adaboost - Undersampled Data: 0.8666113556020489
Gradient Boosting - Undersampled Data: 0.8978572974532861

Validation Performance:

Logistic Regression - Undersampled Data: 0.8525179856115108
Decision Tree - Undersampled Data: 0.841726618705036
Random Forest - Undersampled Data: 0.8920863309352518
Bagging - Undersampled Data: 0.8705035971223022
Adaboost - Undersampled Data: 0.8489208633093526
Gradient Boosting - Undersampled Data: 0.8884892086330936

Observations and Choosing the Three Best models:¶

  • The Random Forest model on the normal data is the most consistent on both the training and testing set. The recall value for the validation and training set is around 0.72. It is one of the best models because it has the best generalization. It can perform consistently on other datasets, and with some tuning, the recall value can be improved.
  • Compared to the models on the original data, the models on the oversampled and undersampled data have higher recall scores. It makes sense that two of the best models should be from the oversampled or undersampled data.
  • The Random Forest - Undersampled Data model has the highest recall score on the validation set. On the training set, the recall is 0.90. On the validation set, the recall is 0.89.
  • The Gradient Boosting - Undersampled Data model has the second highest recall score on the validation set. On the training set, the recall is 0.89. On the validation set, the recall is 0.88.

HyperparameterTuning¶

Tuning Random Forest Model¶

In [44]:
# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': [200,250,300],
              'min_samples_leaf': np.arange(1,4), 
              'max_features' : [np.arange(0.3,0.6,0.1),'sqrt'],
              'max_samples': np.arange(0.4,0.7,0.1) }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train,y_train)

#creating model with the best combination found
tuned_random_forest = randomized_cv.best_estimator_
#applying the combination of parameters on X_train and y_train
tuned_random_forest.fit(X_train, y_train)
Out[44]:
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_samples=0.6, n_estimators=250, random_state=1)
In [45]:
#calculating metrics on training set using predefined function/saving them to new variable
tuned_random_metrics_training = model_performance_classification_sklearn(tuned_random_forest, X_train, y_train)
print('Training Metrics')
print(tuned_random_metrics_training) #printing tuned_random_metrics_training
Training Metrics
   Accuracy    Recall  Precision        F1
0  0.994933  0.908654        1.0  0.952141
In [46]:
#confusion matrix
X_train_pred = tuned_random_forest.predict(X_train) #making predictions for X_train using model
matrix_train = confusion_matrix(y_train, X_train_pred) #building confusion matrix with y_train and X_train_pred
sns.heatmap(matrix_train, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Training Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
In [47]:
#calculating metrics on validation set using predefined function/saving them to new variable
tuned_random_metrics_val = model_performance_classification_sklearn(tuned_random_forest, X_val, y_val)
print('Validation Metrics')
print(tuned_random_metrics_val) #printing tuned_random_metrics_val
Validation Metrics
   Accuracy   Recall  Precision        F1
0    0.9834  0.71223   0.985075  0.826722
In [48]:
#confusion matrix
X_val_pred = tuned_random_forest.predict(X_val) #making predictions for X_val using model
matrix_val = confusion_matrix(y_val, X_val_pred) #building confusion matrix with y_val and X_val_pred
sns.heatmap(matrix_val, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Validation Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap

Observations:¶

The recall for training increased from 0.72 to 0.90, but the recall for validation decreased from 0.72 to 0.71. In addition to that, there is a large difference between the recall scores for training and validation. This model seems to overfit the training data. It does not perform as consistently on the validation data.

Tuning Random Forest - Undersampled Data model¶

In [49]:
# defining model
Model = RandomForestClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'max_depth': np.arange(2,20),
              'min_samples_leaf': [1, 2, 5, 7], 
              'max_leaf_nodes' : [5, 10,15],
              'min_impurity_decrease': [0.0001,0.001] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_un,y_train_un)

#creating model with the best combination found
tuned_random_forest_under = randomized_cv.best_estimator_
#applying the combination of parameters on X_train_un and y_train_un
tuned_random_forest_under.fit(X_train_un, y_train_un)
Out[49]:
RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
                       min_impurity_decrease=0.001, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
                       min_impurity_decrease=0.001, random_state=1)
In [50]:
#calculating metrics on training set using predefined function/saving them to new variable
tuned_random_under_metrics_training = model_performance_classification_sklearn(tuned_random_forest_under, X_train, y_train)
print('Training Metrics')
print(tuned_random_under_metrics_training) #printing tuned_random_under_metrics_training
Training Metrics
   Accuracy    Recall  Precision        F1
0  0.912867  0.907452   0.380353  0.536031
In [51]:
#confusion matrix
X_train_pred = tuned_random_forest_under.predict(X_train) #making predictions for X_train using model
matrix_train = confusion_matrix(y_train, X_train_pred) #building confusion matrix with y_train and X_train_pred
sns.heatmap(matrix_train, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Training Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
In [52]:
#calculating metrics on validation set using predefined function/saving them to new variable
tuned_random_under_metrics_val = model_performance_classification_sklearn(tuned_random_forest_under, X_val, y_val)
print('Validation Metrics')
print(tuned_random_under_metrics_val) #printing tuned_random_under_metrics_val
Validation Metrics
   Accuracy    Recall  Precision        F1
0    0.9012  0.884892   0.347458  0.498986
In [53]:
#confusion matrix
X_val_pred = tuned_random_forest_under.predict(X_val) #making predictions for X_val using model
matrix_val = confusion_matrix(y_val, X_val_pred) #building confusion matrix with y_val and X_val_pred
sns.heatmap(matrix_val, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Validation Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap

Observations:¶

The recall for training remained the same at 0.90, but the recall for validation decreased from 0.89 to 0.88. However, that is still a decent score. The training and validation have a similar recall value, so the model is performing consistently across both datasets.

Tuning Gradient Boosting - Undersampled Data Model¶

In [54]:
# defining model
Model = GradientBoostingClassifier(random_state=1)

# Parameter grid to pass in RandomSearchCV
param_grid = {'n_estimators': np.arange(100,150,25),
              'learning_rate': [0.2,0.05,1], 
              'subsample' : [0.5,0.7],
              'max_features': [0.5,0.7] }

#Calling RandomizedSearchCV
randomized_cv = RandomizedSearchCV(estimator=Model, param_distributions=param_grid, n_iter=10, n_jobs = -1, scoring=scorer, cv=5, random_state=1)

#Fitting parameters in RandomizedSearchCV
randomized_cv.fit(X_train_over,y_train_over)

#creating model with the best combination found
tuned_gradient_under = randomized_cv.best_estimator_
#applying the combination of parameters on X_train_un and y_train_un
tuned_gradient_under.fit(X_train_un, y_train_un)
Out[54]:
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
                           random_state=1, subsample=0.7)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GradientBoostingClassifier(learning_rate=1, max_features=0.5, n_estimators=125,
                           random_state=1, subsample=0.7)
In [55]:
#calculating metrics on training set using predefined function/saving them to new variable
tuned_gradient_under_metrics_training = model_performance_classification_sklearn(tuned_gradient_under, X_train, y_train)
print('Training Metrics')
print(tuned_gradient_under_metrics_training) #printing tuned_gradient_under_metrics_training
Training Metrics
   Accuracy    Recall  Precision        F1
0  0.887867  0.981971   0.328905  0.492762
In [56]:
#confusion matrix
X_train_pred = tuned_gradient_under.predict(X_train) #making predictions for X_train using model
matrix_train = confusion_matrix(y_train, X_train_pred) #building confusion matrix with y_train and X_train_pred
sns.heatmap(matrix_train, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Training Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
In [57]:
#calculating metrics on validation set using predefined function/saving them to new variable
tuned_gradient_under_metrics_val = model_performance_classification_sklearn(tuned_gradient_under, X_val, y_val)
print('Validation Metrics')
print(tuned_gradient_under_metrics_val) #printing tuned_gradient_under_metrics_val
Validation Metrics
   Accuracy    Recall  Precision        F1
0    0.8792  0.874101   0.299261  0.445872
In [58]:
#confusion matrix
X_val_pred = tuned_gradient_under.predict(X_val) #making predictions for X_val using model
matrix_val = confusion_matrix(y_val, X_val_pred) #building confusion matrix with y_val and X_val_pred
sns.heatmap(matrix_val, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Validation Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap

Observations:¶

The recall for training increased from 0.89 to 0.98, but the recall for validation decreased from 0.88 to 0.87. Although it is a decent score, there is a large difference in the training and validation recall scores. The model is not generalizing as well to the validation set.

Model performance comparison and choosing the final model¶

Tuned Random Forest Performance¶

In [59]:
print(tuned_random_metrics_training)
   Accuracy    Recall  Precision        F1
0  0.994933  0.908654        1.0  0.952141
In [60]:
print(tuned_random_metrics_val)
   Accuracy   Recall  Precision        F1
0    0.9834  0.71223   0.985075  0.826722

Tuned Random Forest - Undersampled Data Performance¶

In [61]:
print(tuned_random_under_metrics_training)
   Accuracy    Recall  Precision        F1
0  0.912867  0.907452   0.380353  0.536031
In [62]:
print(tuned_random_under_metrics_val)
   Accuracy    Recall  Precision        F1
0    0.9012  0.884892   0.347458  0.498986

Tuned Gradient Boosting - Undersampled Data Performance¶

In [63]:
print(tuned_gradient_under_metrics_training)
   Accuracy    Recall  Precision        F1
0  0.887867  0.981971   0.328905  0.492762
In [64]:
print(tuned_gradient_under_metrics_val)
   Accuracy    Recall  Precision        F1
0    0.8792  0.874101   0.299261  0.445872

Choosing the Final model:¶

  • The Tuned Random Forest model and Tuned Gradient Boosting - Undersampled Data model both have a larger difference in the training and validation scores, when compared to the the Tuned Random Forest - Undersampled Data model. They both do not perform as consistently on the validation set as they do on the training set. They will not be able to generalize as well on other datasets.

The Tuned Random Forest - Undersampled Data model has a decent recall score on the training and testing set. The scores are also very similar for both. This model will perform consistently on different datasets, so this is the final model.

Final Model Performance on Test Set: Tuned Random Forest - Undersampled Data Model¶

In [65]:
#confusion matrix
X_test_pred = tuned_random_forest_under.predict(X_test) #making predictions for X_test using model
matrix_test = confusion_matrix(y_test, X_test_pred) #building confusion matrix with y_test and X_test_pred
sns.heatmap(matrix_test, fmt='g', annot=True) #creating heatmap for matrix/annotation labels displayed as full numbers
plt.title('Confusion matrix - Testing Set') #heatmap title
plt.xlabel('Predicted Values') #x-axis title
plt.ylabel('Actual Values') #y-axis title
plt.show(); #displaying heatmap
In [66]:
#calculating metrics on testing set using predefined function/saving them to new variable
tuned_random_forest_under_metrics_test = model_performance_classification_sklearn(tuned_random_forest_under, X_test, y_test)
print('Testing Metrics')
print(tuned_random_forest_under_metrics_test) #printing tuned_random_forest_under_metrics_test
Testing Metrics
   Accuracy   Recall  Precision        F1
0    0.9128  0.85461   0.378931  0.525054
In [67]:
#viewing training metrics again
tuned_random_under_metrics_training
Out[67]:
Accuracy Recall Precision F1
0 0.912867 0.907452 0.380353 0.536031

Observations:¶

  • The recall score for testing is decent at 0.85. It is not too far from the training set recall score, which is 0.90.
  • The accuracy is also decent at 0.91.
  • Precision and F1 score are not as high in both testing and training, but in this particular situation, there needs to be a larger focus on recall. So, overall, the model's performance is decent on the testing set.

Pipeline for the Final Model¶

In [68]:
#creating a pipeline called pipeline
#first step is Imputer where missing values are imported
#second step is RFU where final model is created (random forest (undersampled data))
pipeline = Pipeline(
     steps=[
         ('Imputer', imputer),
         ('RFU', RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
                       min_impurity_decrease=0.001
         ),
         ),
     ]
)
In [69]:
#adding all columns of df_test, except Target, to X_test
X_test = df_test.drop(['Target'], axis=1)
#adding Target column to y_test
y_test = df_test['Target']
#adding all columns of df_train, except Target, to X
X = df_train.drop(['Target'], axis=1)
#adding Target column to Y
Y = df_train['Target']
In [70]:
#fitting and using imputer to transform the X set
imputer = SimpleImputer(strategy='median')
X = imputer.fit_transform(X)
In [71]:
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
In [72]:
#fitting pipeline on undersampled training data
pipeline.fit(X_train_un, y_train_un)
Out[72]:
Pipeline(steps=[('Imputer', SimpleImputer(strategy='median')),
                ('RFU',
                 RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
                                        min_impurity_decrease=0.001))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('Imputer', SimpleImputer(strategy='median')),
                ('RFU',
                 RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
                                        min_impurity_decrease=0.001))])
SimpleImputer(strategy='median')
RandomForestClassifier(max_depth=11, max_leaf_nodes=15,
                       min_impurity_decrease=0.001)
In [73]:
#viewing pipeline training metrics
#calculating metrics on training set using predefined function/saving them to new variable
pipeline_training = model_performance_classification_sklearn(pipeline, X_train, y_train)
print('Training Metrics')
print(pipeline_training) #printing pipeline_training
Training Metrics
   Accuracy   Recall  Precision        F1
0  0.910933  0.90024   0.374126  0.528582
In [74]:
#viewing pipeline testing metrics
#calculating metrics on testing set using predefined function/saving them to new variable
pipeline_test = model_performance_classification_sklearn(pipeline, X_test, y_test)
print('Testing Metrics')
print(pipeline_test) #printing pipeline_test
Testing Metrics
   Accuracy   Recall  Precision        F1
0    0.9068  0.85461   0.361862  0.508439

Business Insights and Conclusions¶

  • The recall score for the new model is very decent.
  • The new model performs at such a level that there are a lower number of false negatives and a higher number of true positives.
  • As a result, there will be lower replacement costs.
  • None of the columns are correlated with the target variable, so it cannot be stated that any variable is correlated with chances of failure.
  • However, some of the variables are positively or negatively correlated with each other. This could be due to variables that increase or decrease together like for example, temperature and climate. Exploring them further would require more data.